Modern NLP in Python

- Or -

What you can learn about food by analyzing a million Yelp reviews

Outline

We are gonna be starting with raw data and running through preparing, modeling, visualizing, and analyzing the data. We'll touch on the following points:

A tour of the dataset
Introduction to text processing with spaCy
Automatic phrase modeling
Topic modeling with LDA
Visualizing topic models with pyLDAvis
Word vector models with word2vec
Visualizing word2vec with t-SNE

The Yelp Dataset

The Yelp Dataset is a dataset published by the business review service Yelp for academic research and educational purposes.
It's big (but not so big that you need your own data center to process it), well-connected, and anyone can relate to it — it's largely about food, after all!

Here's how to get the dataset:

Please visit the Yelp dataset webpage here
Click "Get the Data"
Please review, agree to, and respect Yelp's terms of use!
The dataset downloads as a compressed .tgz file; uncompress it
Place the uncompressed dataset files (yelp_academic_dataset_business.json, etc.) in the directory named material/yelp/source

When focusing on restaurants alone, there are approximately 41K restaurants with approximately 1M user reviews written about them.

The data is provided in a handful of files in .json format. We'll be using the following files for our demo:

yelp_academic_dataset_business.json — the records for individual businesses
yelp_academic_dataset_review.json — the records for reviews users wrote about businesses

The files are text files (UTF-8) with one json object per line, each one corresponding to an individual data record. Let's take a look at a few examples.



In [1]:

    
import os

data_directory = os.path.join('material', 'yelp', 'source')
businesses_filepath = os.path.join(data_directory, 'yelp_academic_dataset_business.json')

with open(businesses_filepath, encoding='utf_8') as f:
    first_business_record = f.readline() 

print(first_business_record)









    



{"business_id":"0DI8Dt2PJp07XkVvIElIcQ","name":"Innovative Vapors","neighborhood":"","address":"227 E Baseline Rd, Ste J2","city":"Tempe","state":"AZ","postal_code":"85283","latitude":33.3782141,"longitude":-111.936102,"stars":4.5,"review_count":17,"is_open":0,"attributes":["BikeParking: True","BusinessAcceptsBitcoin: False","BusinessAcceptsCreditCards: True","BusinessParking: {'garage': False, 'street': False, 'validated': False, 'lot': True, 'valet': False}","DogsAllowed: False","RestaurantsPriceRange2: 2","WheelchairAccessible: True"],"categories":["Tobacco Shops","Nightlife","Vape Shops","Shopping"],"hours":["Monday 11:0-21:0","Tuesday 11:0-21:0","Wednesday 11:0-21:0","Thursday 11:0-21:0","Friday 11:0-22:0","Saturday 10:0-22:0","Sunday 11:0-18:0"],"type":"business"}

Only a few attributes will be of interest in task:

business_id — unique identifier for businesses
categories — an array containing relevant category values of businesses

Moreover, we will focus von restaurant, which is indicated by the presence of the Restaurant tag in the categories array.

The review records are stored in a similar manner — key, value pairs containing information about the reviews.



In [2]:

    
review_json_filepath = os.path.join(data_directory, 'yelp_academic_dataset_review.json')

with open(review_json_filepath, encoding='utf_8') as f:
    first_review_record = f.readline()
    
print(first_review_record)









    



{"review_id":"NxL8SIC5yqOdnlXCg18IBg","user_id":"KpkOkG6RIf4Ra25Lhhxf1A","business_id":"2aFiy99vNLklCx3T_tGS9A","stars":5,"date":"2011-10-10","text":"If you enjoy service by someone who is as competent as he is personable, I would recommend Corey Kaplan highly. The time he has spent here has been very productive and working with him educational and enjoyable. I hope not to need him again (though this is highly unlikely) but knowing he is there if I do is very nice. By the way, I'm not from El Centro, CA. but Scottsdale, AZ.","useful":0,"funny":0,"cool":0,"type":"review"}

The only attributes we a concerned with are:

business_id — indicates which business the review is about
text — the natural language text the user wrote

json is a handy file format for data interchange, but it's typically not the most usable for any sort of modeling work. Let's do a bit more data preparation to get our data in a more usable format. Our next code block will do the following:

Read in each business record and convert it to a Python dict
Filter out business records that aren't about restaurants (i.e., not in the "Restaurant" category)
Choose a subset of the restaurant to keep computations tractable.
Create a frozenset of the business IDs for restaurants, which we'll use in the next step



In [3]:

    
import json
from numpy.random import choice

restaurant_ids = set()

# open the businesses file
with open(businesses_filepath, encoding='utf_8') as f:
    
    # iterate through each line (json record) in the file
    for business_json in f:
        
        # convert the json record to a Python dict
        business = json.loads(business_json)
        
        # if this business is not a restaurant, skip to the next one
        if business['categories'] is None or 'Restaurants' not in business['categories']:
            continue
            
        # add the restaurant business id to our restaurant_ids set
        restaurant_ids.add(business['business_id'])

# choose a subset of the restaurant ids
subset = []
subset_size = 20000
for restaurant_id, _ in zip(restaurant_ids, range(subset_size)):
    subset.append(restaurant_id)

# turn restaurant_ids into a frozenset, as we don't need to change it anymore
restaurant_ids = frozenset(subset)

# print the number of unique restaurant ids in the dataset
print('{} restaurants in the dataset'.format(len(restaurant_ids)))









    



20000 restaurants in the dataset

Next, we will create a new file that contains only the text from reviews about restaurants, with one review per line in the file.



In [4]:

    
intermediate_directory = os.path.join('material', 'yelp', 'intermediate_results')

review_txt_filepath = os.path.join(intermediate_directory, 'review_text_all.txt')



In [5]:

    
%%time

# this is a bit time consuming - make the if statement True
# if you want to execute data prep yourself.
if False:
    review_count = 0

    # create & open a new file in write mode
    with open(review_txt_filepath, 'w', encoding='utf_8') as review_txt_file:

        # open the existing review json file
        with open(review_json_filepath, encoding='utf_8') as review_json_file:

            # loop through all reviews in the existing file and convert to dict
            for review_json in review_json_file:
                review = json.loads(review_json)

                # if this review is not about a restaurant, skip to the next one
                #import pdb; pdb.set_trace()
                if review['business_id'] not in restaurant_ids:
                    continue

                # write the restaurant review as a line in the new file
                # escape newline characters in the original review text
                review_txt_file.write(review['text'].replace('\n', '\\n') + '\n')
                review_count += 1

    print('Text from {} restaurant reviews written to the new txt file.'.format(review_count))
    
else:
    with open(review_txt_filepath, encoding='utf_8') as review_txt_file:
        for review_count, line in enumerate(review_txt_file):
            pass   
    print('Text from {} restaurant reviews in the txt file.'.format(review_count + 1))









    



Text from 1078371 restaurant reviews in the txt file.
CPU times: user 2.97 s, sys: 284 ms, total: 3.25 s
Wall time: 3.49 s

spaCy — Industrial-Strength NLP in Python

spaCy is a natural language processing (NLP) library for Python. spaCy's goal is to take recent advancements in natural language processing out of research papers and put them in the hands of users to build production software.

spaCy handles many tasks commonly associated with building an end-to-end natural language processing pipeline:

Tokenization
Text normalization, such as lowercasing, stemming/lemmatization
Part-of-speech tagging
Sentence boundary detection
Named entity recognition

spaCy is written in optimized Cython, which means it's fast.



In [6]:

    
import spacy
import pandas as pd
import itertools as it

nlp = spacy.load('en')

Let's grab a sample review to play with.



In [7]:

    
with open(review_txt_filepath, encoding='utf_8') as f:
    sample_review = list(it.islice(f, 2, 3))[0]
    sample_review = sample_review.replace('\\n', '\n')
        
print(sample_review)









    



Khotan sucks as a place to eat. The girls and I went to this place for "Dinner" but really all we got was appetizers. We went during happy hour so it was suppose to be cheaper, but it don't mean the size is bigger. I ordered some spicy tuna and it was $7 for like a tiny portion that didn't do anything for me at all. The roll were so tiny it was about the size of a quarter and I got four of those. BOO times 2. The only reason this place got two stars out of me was the bartender. They both were entertaining and he gave us some drinks for free. Atmosphere is really not impressive either.

Handing the review text to spaCy.



In [8]:

    
%%time
parsed_review = nlp(sample_review)









    



CPU times: user 12 ms, sys: 0 ns, total: 12 ms
Wall time: 11.9 ms



In [9]:

    
print(parsed_review)









    



Khotan sucks as a place to eat. The girls and I went to this place for "Dinner" but really all we got was appetizers. We went during happy hour so it was suppose to be cheaper, but it don't mean the size is bigger. I ordered some spicy tuna and it was $7 for like a tiny portion that didn't do anything for me at all. The roll were so tiny it was about the size of a quarter and I got four of those. BOO times 2. The only reason this place got two stars out of me was the bartender. They both were entertaining and he gave us some drinks for free. Atmosphere is really not impressive either.

Looks the same! What happened under the hood?

What about sentence detection and segmentation?



In [10]:

    
for num, sentence in enumerate(parsed_review.sents):
    print('Sentence {}:'.format(num + 1))
    print(sentence)
    print('')









    



Sentence 1:
Khotan sucks as a place to eat.

Sentence 2:
The girls and I went to this place for "Dinner" but really all we got was appetizers.

Sentence 3:
We went during happy hour

Sentence 4:
so it was suppose to be cheaper, but it don't mean the size is bigger.

Sentence 5:
I ordered some spicy tuna and it was $7 for like a tiny portion that didn't do anything for me at all.

Sentence 6:
The roll were so tiny it was about the size of a quarter and I got four of those.

Sentence 7:
BOO times 2.

Sentence 8:
The only reason this place got two stars out of me was the bartender.

Sentence 9:
They both were entertaining and he gave us some drinks for free.

Sentence 10:
Atmosphere is really not impressive either.

What about named entity detection?



In [11]:

    
for num, entity in enumerate(parsed_review.ents):
    print('Entity {}:'.format(num + 1), entity, '-', entity.label_)
    print('')









    



Entity 1: Khotan - GPE

Entity 2: Dinner - PERSON

Entity 3: 7 - MONEY

Entity 4: four - CARDINAL

Entity 5: 2 - CARDINAL

Entity 6: two - CARDINAL

What about part of speech tagging?



In [12]:

    
token_text = [token.orth_ for token in parsed_review]
token_pos = [token.pos_ for token in parsed_review]

pd.DataFrame(list(zip(token_text, token_pos)), columns=['token_text', 'part_of_speech'])









    Out[12]:







  
    
      
      token_text
      part_of_speech
    
  
  
    
      0
      Khotan
      PROPN
    
    
      1
      sucks
      VERB
    
    
      2
      as
      ADP
    
    
      3
      a
      DET
    
    
      4
      place
      NOUN
    
    
      5
      to
      PART
    
    
      6
      eat
      VERB
    
    
      7
      .
      PUNCT
    
    
      8
      The
      DET
    
    
      9
      girls
      NOUN
    
    
      10
      and
      CONJ
    
    
      11
      I
      PRON
    
    
      12
      went
      VERB
    
    
      13
      to
      ADP
    
    
      14
      this
      DET
    
    
      15
      place
      NOUN
    
    
      16
      for
      ADP
    
    
      17
      "
      PUNCT
    
    
      18
      Dinner
      PROPN
    
    
      19
      "
      PUNCT
    
    
      20
      but
      CONJ
    
    
      21
      really
      ADV
    
    
      22
      all
      DET
    
    
      23
      we
      PRON
    
    
      24
      got
      VERB
    
    
      25
      was
      VERB
    
    
      26
      appetizers
      NOUN
    
    
      27
      .
      PUNCT
    
    
      28
      We
      PRON
    
    
      29
      went
      VERB
    
    
      ...
      ...
      ...
    
    
      106
      two
      NUM
    
    
      107
      stars
      NOUN
    
    
      108
      out
      ADP
    
    
      109
      of
      ADP
    
    
      110
      me
      PRON
    
    
      111
      was
      VERB
    
    
      112
      the
      DET
    
    
      113
      bartender
      NOUN
    
    
      114
      .
      PUNCT
    
    
      115
      They
      PRON
    
    
      116
      both
      DET
    
    
      117
      were
      VERB
    
    
      118
      entertaining
      ADJ
    
    
      119
      and
      CONJ
    
    
      120
      he
      PRON
    
    
      121
      gave
      VERB
    
    
      122
      us
      PRON
    
    
      123
      some
      DET
    
    
      124
      drinks
      NOUN
    
    
      125
      for
      ADP
    
    
      126
      free
      ADJ
    
    
      127
      .
      PUNCT
    
    
      128
      Atmosphere
      PROPN
    
    
      129
      is
      VERB
    
    
      130
      really
      ADV
    
    
      131
      not
      ADV
    
    
      132
      impressive
      ADJ
    
    
      133
      either
      ADV
    
    
      134
      .
      PUNCT
    
    
      135
      \n
      SPACE
    
  

136 rows × 2 columns

There is much more, like:

stopword
punctuation
whitespace
represents a number
whether or not the token is included in spaCy's default vocabulary?



In [13]:

    
token_attributes = [(token.orth_,
                     token.prob,
                     token.is_stop,
                     token.is_punct,
                     token.is_space,
                     token.like_num,
                     token.is_oov)
                    for token in parsed_review]

df = pd.DataFrame(token_attributes,
                  columns=['text',
                           'log_probability',
                           'stop?',
                           'punctuation?',
                           'whitespace?',
                           'number?',
                           'out of vocab.?'])

df.loc[:, 'stop?':'out of vocab.?'] = (df.loc[:, 'stop?':'out of vocab.?']
                                       .applymap(lambda x: 'Yes' if x else ''))
                                               
df









    Out[13]:







  
    
      
      text
      log_probability
      stop?
      punctuation?
      whitespace?
      number?
      out of vocab.?
    
  
  
    
      0
      Khotan
      -19.502029
      
      
      
      
      Yes
    
    
      1
      sucks
      -9.515271
      
      
      
      
      
    
    
      2
      as
      -5.534485
      Yes
      
      
      
      
    
    
      3
      a
      -3.929788
      Yes
      
      
      
      
    
    
      4
      place
      -7.954748
      
      
      
      
      
    
    
      5
      to
      -3.856022
      Yes
      
      
      
      
    
    
      6
      eat
      -8.822906
      
      
      
      
      
    
    
      7
      .
      -3.067898
      
      Yes
      
      
      
    
    
      8
      The
      -5.958707
      Yes
      
      
      
      
    
    
      9
      girls
      -9.163706
      
      
      
      
      
    
    
      10
      and
      -4.113108
      Yes
      
      
      
      
    
    
      11
      I
      -3.791565
      Yes
      
      
      
      
    
    
      12
      went
      -8.091074
      
      
      
      
      
    
    
      13
      to
      -3.856022
      Yes
      
      
      
      
    
    
      14
      this
      -5.361816
      Yes
      
      
      
      
    
    
      15
      place
      -7.954748
      
      
      
      
      
    
    
      16
      for
      -4.880109
      Yes
      
      
      
      
    
    
      17
      "
      -5.026776
      
      Yes
      
      
      
    
    
      18
      Dinner
      -13.237526
      
      
      
      
      
    
    
      19
      "
      -5.026776
      
      Yes
      
      
      
    
    
      20
      but
      -5.341969
      Yes
      
      
      
      
    
    
      21
      really
      -6.370757
      Yes
      
      
      
      
    
    
      22
      all
      -5.936641
      Yes
      
      
      
      
    
    
      23
      we
      -6.230024
      Yes
      
      
      
      
    
    
      24
      got
      -6.988554
      
      
      
      
      
    
    
      25
      was
      -5.252320
      Yes
      
      
      
      
    
    
      26
      appetizers
      -14.932755
      
      
      
      
      
    
    
      27
      .
      -3.067898
      
      Yes
      
      
      
    
    
      28
      We
      -7.402578
      Yes
      
      
      
      
    
    
      29
      went
      -8.091074
      
      
      
      
      
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      106
      two
      -7.433600
      Yes
      
      
      Yes
      
    
    
      107
      stars
      -10.723899
      
      
      
      
      
    
    
      108
      out
      -6.002701
      Yes
      
      
      
      
    
    
      109
      of
      -4.275874
      Yes
      
      
      
      
    
    
      110
      me
      -5.846090
      Yes
      
      
      
      
    
    
      111
      was
      -5.252320
      Yes
      
      
      
      
    
    
      112
      the
      -3.528767
      Yes
      
      
      
      
    
    
      113
      bartender
      -12.436255
      
      
      
      
      
    
    
      114
      .
      -3.067898
      
      Yes
      
      
      
    
    
      115
      They
      -7.078901
      Yes
      
      
      
      
    
    
      116
      both
      -7.750915
      Yes
      
      
      
      
    
    
      117
      were
      -6.673175
      Yes
      
      
      
      
    
    
      118
      entertaining
      -10.942356
      
      
      
      
      
    
    
      119
      and
      -4.113108
      Yes
      
      
      
      
    
    
      120
      he
      -5.931905
      Yes
      
      
      
      
    
    
      121
      gave
      -8.983262
      
      
      
      
      
    
    
      122
      us
      -7.643694
      Yes
      
      
      
      
    
    
      123
      some
      -6.318882
      Yes
      
      
      
      
    
    
      124
      drinks
      -10.731253
      
      
      
      
      
    
    
      125
      for
      -4.880109
      Yes
      
      
      
      
    
    
      126
      free
      -8.044009
      
      
      
      
      
    
    
      127
      .
      -3.067898
      
      Yes
      
      
      
    
    
      128
      Atmosphere
      -14.375864
      
      
      
      
      
    
    
      129
      is
      -4.457749
      Yes
      
      
      
      
    
    
      130
      really
      -6.370757
      Yes
      
      
      
      
    
    
      131
      not
      -5.332601
      Yes
      
      
      
      
    
    
      132
      impressive
      -10.566375
      
      
      
      
      
    
    
      133
      either
      -7.965898
      Yes
      
      
      
      
    
    
      134
      .
      -3.067898
      
      Yes
      
      
      
    
    
      135
      \n
      -6.050651
      
      
      Yes
      
      
    
  

136 rows × 7 columns

Phrase Modeling

Phrase modeling is an approach to learning combinations of tokens that together represent meaningful multi-word concepts.
We can develop phrase models by looping over the words in our reviews and looking for words that co-occur (i.e., appear one after another) together much more frequently than you would expect them to by random chance.
The formula our phrase models will use to determine whether two tokens $A$ and $B$ constitute a phrase is:

$$\frac{count(A,\ B)}{count(A) * count(B)} > threshold$$

...where:

$count(A)$ is the number of times token $A$ appears in the corpus
$count(B)$ is the number of times token $B$ appears in the corpus
$count(A,\ B)$ is the number of times the tokens $A\ B$ appear in the corpus in order
$threshold$ is a user-defined parameter to control how strong of a relationship between two tokens the model requires before accepting them as a phrase

Once our phrase model has been trained on our corpus, we can apply it to new text.
When our model encounters two tokens in new text that identifies as a phrase, it will merge the two into a single new token (so new york would become new_york).

We will use the gensim library to help us with phrase modeling.



In [14]:

    
from gensim.models import Phrases
from gensim.models.word2vec import LineSentence
from gensim.models.phrases import Phraser









    



Slow version of gensim.models.doc2vec is being used

As we're performing phrase modeling, we'll be doing some iterative data transformation at the same time:

Segment text of complete reviews into sentences.
Normalize text.
Phrase modeling $\rightarrow$ apply phrase model to transform sentences

We'll use this transformed data as the input for some higher-level modeling approaches in the following sections.

First, let's define a few helper functions that we'll use for text normalization. In particular, the lemmatized_sentence_corpus generator function will use spaCy to:

Iterate over the 1M reviews
Segment the reviews into individual sentences
Remove punctuation and excess whitespace
Lemmatize the text

... and do so efficiently in parallel, thanks to spaCy's nlp.pipe() function.



In [15]:

    
def lemmatized_sentence_corpus(filename):
    """
    generator function to use spaCy to parse reviews,
    lemmatize the text, and yield sentences
    """
    for parsed_review in nlp.pipe(line_review(filename), batch_size=10000, n_threads=4):
        for sent in parsed_review.sents:
            yield ' '.join([token.lemma_ for token in sent if not punct_space(token)])

def line_review(filename):
    """
    generator function to read in reviews from the file
    and un-escape the original line breaks in the text
    """
    with open(filename, encoding='utf_8') as f:
        for review in f:
            yield review.replace('\\n', '\n')

def punct_space(token):
    """
    helper function to eliminate tokens
    that are pure punctuation or whitespace
    """
    return token.is_punct or token.is_space



In [16]:

    
unigram_sentences_filepath = os.path.join(intermediate_directory, 'unigram_sentences_all.txt')

We'll write the data back out to a new file (unigram_sentences_all), with one normalized sentence per line. We'll use this data for learning our phrase models.



In [17]:

    
%%time

# this is a bit time consuming - make the if statement True
# if you want to execute data prep yourself.
if False:
    with open(unigram_sentences_filepath, 'w', encoding='utf_8') as f:
        for sentence in lemmatized_sentence_corpus(review_txt_filepath):
            f.write(sentence + '\n')









    



CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 10.5 µs

If your data is organized like we just did gensim's LineSentence class provides a convenient iterator for working with other gensim components.
It streams the documents/sentences from disk, so that you never have to hold the entire corpus in RAM at once.
This allows you to scale your modeling pipeline up to potentially very large corpora.



In [18]:

    
unigram_sentences = LineSentence(unigram_sentences_filepath)

Let's take a look at a few sample sentences in our new, transformed file.



In [19]:

    
for unigram_sentence in it.islice(unigram_sentences, 100, 110):
    print(' '.join(unigram_sentence))
    print('---')









    



by the time we finish our large dinner order we be almost full but definitely not ready to stop eat
---
we each order another round of nigiri my boyfriend get tuna and i get salmon yum
---
what a hide gem thanks to groupon for help us find this one
---
we will definitely be return next time we 're in vegas
---
sushi be very average and very expensive with no wow factor
---
taste just bleh
---
miso soup be very salty and ginger salad be make from a lot of weird lettuce that be not cut small and difficult to eat
---
take a long time to get the salad
---
we ask to sit on the balcony to watch the show and they give us a table at the backside so we could not see
---
service be barely ok and keep make excuse why it be take so long
---

Next, we'll learn a phrase model that will link individual words into two-word phrases. We'd expect words that together represent a specific concept, like "rib eye", to be linked together to form a new, single token: "rib_eye".



In [20]:

    
bigram_model_filepath = os.path.join(intermediate_directory, 'bigram_model_all')



In [21]:

    
%%time

# this is a bit time consuming - make the if statement True
# if you want to execute modeling yourself.
if False:
    bigram_model = Phrases(unigram_sentences)
    bigram_model.save(bigram_model_filepath)
    
# load the finished model from disk
bigram_model = Phrases.load(bigram_model_filepath)

# A Phraser is smaller and faster
bigram_phraser = Phraser(bigram_model)









    



CPU times: user 1min 30s, sys: 696 ms, total: 1min 31s
Wall time: 1min 31s

Now that we have a trained phrase model for word pairs, let's apply it to the review sentences data and explore the results.



In [22]:

    
bigram_sentences_filepath = os.path.join(intermediate_directory, 'bigram_sentences_all.txt')



In [23]:

    
%%time

# this is a bit time consuming - make the if statement True
# if you want to execute data prep yourself.
if False:
    with open(bigram_sentences_filepath, 'w', encoding='utf_8') as f:
        for unigram_sentence in unigram_sentences:
            bigram_sentence = ' '.join(bigram_phraser[unigram_sentence])
            f.write(bigram_sentence + '\n')









    



CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 9.06 µs



In [24]:

    
bigram_sentences = LineSentence(bigram_sentences_filepath)



In [25]:

    
for bigram_sentence in it.islice(bigram_sentences, 1480, 1500):
    print(' '.join(bigram_sentence))
    print('---')









    



it have go through a few name change since then but have be luna 's for the past several year
---
my really good experience here have outweigh the not so great experience
---
i consider it a local spot so it ' not five_star dining by any stretch
---
but the portion be still good sized and there be generally daily_special for a good price
---
i miss the old farmer_boy restaurant but i think luna 's be still a good option to consider if you 're look for a good meal for a reasonable price
---
bad than a grade_school cafeteria can food
---
only bright spot it be definitely low salt
---
french_toast be mushy and nasty
---
coffee smell burn but have no taste
---
ask non coffee drinking husband to try and he agree
---
no_matter the coupon not worth the meal
---
my wife and i both order the rib_eye steak dinner for $_10.99
---
it come with a choice of soup or salad we choose the salad with a delicious greek house dressing giant bake_potato and mixed_vegetable out of a can and free dessert choice of ice_cream or rice pudding we choose rice pudding and bring it home
---
the steak be slightly over cook and not great but not bad smother in onion and mushroom out of a can
---
the baked_potato be the highlight of the meal
---
luna 's be an old_fashioned kind of restaurant not very busy on a saturday_night
---
good service she forget the dinner roll but get them immediately when we ask and very good value
---
we will be back
---
food be bland
---
remind_me of hospital food
---

We now see two-word phrases, such as "ice_cream" and "french_toast", linked together in the text as a single token.

Next, we'll train a second-order phrase model. We'll apply the second-order phrase model on top of the already-transformed data, so that incomplete word combinations like "rib eye steak" will become fully joined to "rib_eye_steak".



In [26]:

    
trigram_model_filepath = os.path.join(intermediate_directory, 'trigram_model_all')



In [27]:

    
%%time

# this is a bit time consuming - make the if statement True
# if you want to execute modeling yourself.
if False:
    trigram_model = Phrases(bigram_sentences)
    trigram_model.save(trigram_model_filepath)

# load the finished model from disk
trigram_model = Phrases.load(trigram_model_filepath)
# A Phraser is smaller and faster
trigram_phraser = Phraser(trigram_model)









    



CPU times: user 2min 6s, sys: 628 ms, total: 2min 7s
Wall time: 2min 7s

We'll apply our trained second-order phrase model to our first-order transformed sentences, write the results out to a new file, and explore a few of the second-order transformed sentences.



In [28]:

    
trigram_sentences_filepath = os.path.join(intermediate_directory, 'trigram_sentences_all.txt')



In [29]:

    
%%time

# this is a bit time consuming - make the if statement True
# if you want to execute data prep yourself.
if False:
    with open(trigram_sentences_filepath, 'w', encoding='utf_8') as f:
        for bigram_sentence in bigram_sentences:
            trigram_sentence = ' '.join(trigram_phraser[bigram_sentence])
            f.write(trigram_sentence + '\n')









    



CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 6.44 µs



In [30]:

    
trigram_sentences = LineSentence(trigram_sentences_filepath)



In [31]:

    
for trigram_sentence in it.islice(trigram_sentences, 1480, 1500):
    print(' '.join(trigram_sentence))
    print('---')









    



it have go through a few name change since then but have be luna_'s for the past several year
---
my really good experience here have outweigh the not so great experience
---
i consider it a local spot so it ' not five_star dining by any stretch
---
but the portion be still good sized and there be generally daily_special for a good price
---
i miss the old farmer_boy restaurant but i think luna_'s be still a good option to consider if you 're look for a good meal for a reasonable_price
---
bad than a grade_school_cafeteria can food
---
only bright spot it be definitely low salt
---
french_toast be mushy and nasty
---
coffee smell burn but have no taste
---
ask non coffee drinking husband to try and he agree
---
no_matter the coupon not worth the meal
---
my wife and i both order the rib_eye_steak dinner for $_10.99
---
it come with a choice of soup or salad we choose the salad with a delicious greek house dressing giant bake_potato and mixed_vegetable out of a can and free dessert choice of ice_cream or rice_pudding we choose rice_pudding and bring it home
---
the steak be slightly over cook and not great but not bad smother in onion and mushroom out of a can
---
the baked_potato be the highlight of the meal
---
luna_'s be an old_fashioned kind of restaurant not very busy on a saturday_night
---
good service she forget the dinner roll but get them immediately when we ask and very good value
---
we will be back
---
food be bland
---
remind_me of hospital food
---

Looks like the second-order phrase model was successful. We're now seeing three-word phrases, such as "rib_eye_steak".

The final step of our text preparation process circles back to the complete text of the reviews. We're going to run the complete text of the reviews through a pipeline that applies our text normalization and phrase models.

In addition, we'll remove stopwords at this point.

Finally, we'll write the transformed text out to a new file, with one review per line.



In [32]:

    
trigram_reviews_filepath = os.path.join(intermediate_directory, 'trigram_transformed_reviews_all.txt')



In [33]:

    
%%time

# this is a bit time consuming - make the if statement True
# if you want to execute data prep yourself.
if False:
    with open(trigram_reviews_filepath, 'w', encoding='utf_8') as f:
        for parsed_review in nlp.pipe(line_review(review_txt_filepath),
                                      batch_size=10000, n_threads=4):
            
            # lemmatize the text, removing punctuation and whitespace
            unigram_review = [token.lemma_ for token in parsed_review
                              if not punct_space(token)]
            
            # apply the first-order and second-order phrase models
            bigram_review = bigram_phraser[unigram_review]
            trigram_review = trigram_phraser[bigram_review]
            
            # remove any remaining stopwords
            trigram_review = [term for term in trigram_review
                              if term not in spacy.en.STOPWORDS]
            
            # write the transformed review as a line in the new file
            trigram_review = ' '.join(trigram_review)
            f.write(trigram_review + '\n')









    



CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 11.2 µs

Let's grab the same review from the file with the normalized and transformed text, and compare the two.



In [34]:

    
print('Original:\n')
for review in it.islice(line_review(review_txt_filepath), 49, 50):
    print(review)

print('----\n')

print('Transformed:\n')
with open(trigram_reviews_filepath, encoding='utf_8') as f:
    for review in it.islice(f, 49, 50):
        print(review)









    



Original:

No no no no! Where do I start, perhaps slapping myself in the face for not having read the only other review on this place BEFORE crossing the threshold of the Clifton Fish Bar. 

To cut a long story short, I ordered a pizza with salami, peppers, mushrooms and anchovies. Simple. 45 mins later I received a pizza with peppers and anchovies. On quizzing the (grumpy) guy behind the counter as to the invisible ingredients i had ordered and paid for, he told me they didn't have any mushrooms or salami left. "Then why didn't you tell me this small, insignificant fact at the time i ordered, or subsequent 45 mins whilst I was in your shop?"...no reply other than a look of get outa my shop. Appalling. 

And how was the pizza you ask? Don't know, ask the rubbish bin not far from his shop.

If I could give 0 stars for this place, I would.

----

Transformed:

start slap face read review place cross threshold clifton fish bar cut long_story_short order pizza salami pepper mushroom anchovy simple 45_min later receive pizza pepper anchovy quiz grumpy guy counter invisible ingredient order pay tell mushroom salami leave tell small insignificant fact time order subsequent 45_min whilst shop?" reply look outa shop appal pizza ask know ask rubbish_bin far shop 0_star place

Most of the grammatical structure has been removed from the text — capitalization, articles/conjunctions, punctuation, spacing, etc.
However, much of the general semantic meaning is still present.
Also, multi-word concepts such as "long_story_short" and "45_min" have been joined into single tokens, as expected.
The review text is now ready for higher-level modeling.

Topic Modeling with Latent Dirichlet Allocation (LDA)

Topic modeling is family of techniques that can be used to describe and summarize the documents in a corpus according to a set of latent "topics". For this demo, we'll be using Latent Dirichlet Allocation or LDA, a popular approach to topic modeling.

In many conventional NLP applications, documents are represented by a mixture of the individual tokens (words and phrases). In other words, a document is represented as a vector of token counts. There are two layers in this model — documents and tokens — and the size or dimensionality of the document vectors is the number of tokens in the corpus vocabulary. This approach has a number of disadvantages:

Document vectors tend to be large (one dimension for each token $\Rightarrow$ lots of dimensions)
They also tend to be very sparse. Any given document only contains a small fraction of all tokens in the vocabulary, so most values in the document's token vector are 0.
The dimensions are fully indepedent from each other — there's no sense of connection between related tokens, such as knife and fork.

LDA injects a third layer into this conceptual model. Documents are represented as a mixture of a pre-defined number of topics, and the topics are represented as a mixture of the individual tokens in the vocabulary. The number of topics is a model hyperparameter selected by the practitioner. LDA makes a prior assumption that the (document, topic) and (topic, token) mixtures follow Dirichlet probability distributions. This assumption encourages documents to consist mostly of a handful of topics, and topics to consist mostly of a modest set of the tokens.

LDA is fully unsupervised. The topics are "discovered" automatically from the data by trying to maximize the likelihood of observing the documents in your corpus, given the modeling assumptions. They are expected to capture some latent structure and organization within the documents, and often have a meaningful human interpretation for people familiar with the subject material.

We'll again turn to gensim to assist with data preparation and modeling. In particular, gensim offers a high-performance parallelized implementation of LDA with its LdaMulticore class.



In [35]:

    
from gensim.corpora import Dictionary, MmCorpus
from gensim.models.ldamulticore import LdaMulticore

import pyLDAvis
import pyLDAvis.gensim
import warnings
import _pickle as pickle

The first step to creating an LDA model is to learn the full vocabulary of the corpus to be modeled. We'll use gensim's Dictionary class for this.



In [36]:

    
trigram_dictionary_filepath = os.path.join(intermediate_directory, 'trigram_dict_all.dict')



In [37]:

    
%%time

# this is a bit time consuming - make the if statement True
# if you want to learn the dictionary yourself.
if False:

    trigram_reviews = LineSentence(trigram_reviews_filepath)

    # learn the dictionary by iterating over all of the reviews
    trigram_dictionary = Dictionary(trigram_reviews)
    
    # filter tokens that are very rare or too common from
    # the dictionary (filter_extremes) and reassign integer ids (compactify)
    trigram_dictionary.filter_extremes(no_below=10, no_above=0.4)
    trigram_dictionary.compactify()

    trigram_dictionary.save(trigram_dictionary_filepath)
    
# load the finished dictionary from disk
trigram_dictionary = Dictionary.load(trigram_dictionary_filepath)









    



CPU times: user 36 ms, sys: 8 ms, total: 44 ms
Wall time: 61.9 ms

LDA uses the simplifying bag-of-words assumption.

Using the gensim Dictionary we learned to generate a bag-of-words representation for each review. The trigram_bow_generator function implements this. We'll save the resulting bag-of-words reviews as a matrix.



In [38]:

    
trigram_bow_filepath = os.path.join(intermediate_directory, 'trigram_bow_corpus_all.mm')



In [39]:

    
def trigram_bow_generator(filepath):
    """
    generator function to read reviews from a file
    and yield a bag-of-words representation
    """
    for review in LineSentence(filepath):
        yield trigram_dictionary.doc2bow(review)



In [40]:

    
%%time

# this is a bit time consuming - make the if statement True
# if you want to build the bag-of-words corpus yourself.
if False:

    # generate bag-of-words representations for
    # all reviews and save them as a matrix
    MmCorpus.serialize(trigram_bow_filepath,
                       trigram_bow_generator(trigram_reviews_filepath))
    
# load the finished bag-of-words corpus from disk
trigram_bow_corpus = MmCorpus(trigram_bow_filepath)









    



CPU times: user 136 ms, sys: 4 ms, total: 140 ms
Wall time: 169 ms

Now we can learn our topic model from the reviews.
We simply need to pass the bag-of-words matrix and Dictionary from our previous steps to the LdaMulticore model.



In [41]:

    
lda_model_filepath = os.path.join(intermediate_directory, 'lda_model_all')



In [42]:

    
%%time

# this is a bit time consuming - make the if statement True
# if you want to train the LDA model yourself.
if False:

    with warnings.catch_warnings():
        warnings.simplefilter('ignore')
        
        # workers => sets the parallelism, and should be
        # set to your number of physical cores minus one
        lda = LdaMulticore(trigram_bow_corpus,
                           num_topics=50,
                           id2word=trigram_dictionary,
                           workers=4)
    
    lda.save(lda_model_filepath)
    
# load the finished LDA model from disk
lda = LdaMulticore.load(lda_model_filepath)









    



CPU times: user 308 ms, sys: 60 ms, total: 368 ms
Wall time: 706 ms

Since each topic is represented as a mixture of tokens, you can manually inspect which tokens have been grouped together into which topics to try to understand the patterns the model has discovered in the data.



In [43]:

    
def explore_topic(topic_number, topn=25):
    """
    accept a user-supplied topic number and
    print out a formatted list of the top terms
    """
    print('{:20} {}\n'.format('term', 'frequency'))
    for term, frequency in lda.show_topic(topic_number, topn=25):
        print('{:20} {:.3f}'.format(term, round(frequency, 3)))

Iteresting topics are:
0, 1, 8, 10, 11, 15, 17, 21, 23, 24, 28, 32, 35, 40, 42, 46, 48



In [44]:

    
explore_topic(topic_number=0)









    



term                 frequency

chicken              0.083
sauce                0.030
bbq                  0.023
wing                 0.021
meat                 0.021
try                  0.016
rib                  0.015
order                0.013
fried                0.012
tender               0.012
flavor               0.010
brisket              0.010
like                 0.010
eat                  0.009
fry                  0.009
best                 0.008
time                 0.008
spicy                0.008
come                 0.008
dry                  0.008
taste                0.007
juicy                0.006
sweet                0.006
pork                 0.006
pull_pork            0.006

Manually reviewing the top terms for each topic is a helpful exercise, but to get a deeper understanding of the topics and how they relate to each other, we need to visualize the data — preferably in an interactive format. Fortunately, we have the fantastic pyLDAvis library to help with that!

pyLDAvis includes a one-line function to take topic models created with gensim and prepare their data for visualization.



In [45]:

    
LDAvis_data_filepath = os.path.join(intermediate_directory, 'ldavis_prepared')



In [46]:

    
#%%time

# this is a bit time consuming - make the if statement True
# if you want to execute data prep yourself.
if False:
    LDAvis_prepared = pyLDAvis.gensim.prepare(lda, trigram_bow_corpus, trigram_dictionary)

    with open(LDAvis_data_filepath, 'wb') as f:
        pickle.dump(LDAvis_prepared, f)
        
# load the pre-prepared pyLDAvis data from disk
with open(LDAvis_data_filepath, 'rb') as f:
    LDAvis_prepared = pickle.load(f)

pyLDAvis.display(...) displays the topic model visualization in-line in the notebook.



In [47]:

    
pyLDAvis.display(LDAvis_prepared)









    Out[47]:








Selected Topic: 
Slide to adjust relevance metric:(2)
λ = 1

Wait, what am I looking at again?

There are a lot of moving parts in the visualization. Here's a brief summary:

On the left, there is a plot of the "distance" between all of the topics (labeled as the Intertopic Distance Map)
- The plot is rendered in two dimensions according a multidimensional scaling (MDS) algorithm. Topics that are generally similar should appear close together on the plot, while dissimilar topics should appear far apart.
- The relative size of a topic's circle in the plot corresponds to the relative frequency of the topic in the corpus.
- An individual topic may be selected for closer scrutiny by clicking on its circle, or entering its number in the "selected topic" box in the upper-left.
On the right, there is a bar chart showing top terms.
- When no topic is selected in the plot on the left, the bar chart shows the top-30 most "salient" terms in the corpus. A term's saliency is a measure of both how frequent the term is in the corpus and how "distinctive" it is in distinguishing between different topics.
- When a particular topic is selected, the bar chart changes to show the top-30 most "relevant" terms for the selected topic. The relevance metric is controlled by the parameter $\lambda$, which can be adjusted with a slider above the bar chart.
  - Setting the $\lambda$ parameter close to 1.0 (the default) will rank the terms solely according to their probability within the topic.
  - Setting $\lambda$ close to 0.0 will rank the terms solely according to their "distinctiveness" or "exclusivity" within the topic — i.e., terms that occur only in this topic, and do not occur in other topics.
  - Setting $\lambda$ to values between 0.0 and 1.0 will result in an intermediate ranking, weighting term probability and exclusivity accordingly.
Rolling the mouse over a term in the bar chart on the right will cause the topic circles to resize in the plot on the left, to show the strength of the relationship between the topics and the selected term.

Analyzing our LDA model

The interactive visualization pyLDAvis produces is helpful for both:

Better understanding and interpreting individual topics, and
Better understanding the relationships between the topics.

For (1), you can manually select each topic to view its top most freqeuent and/or "relevant" terms, using different values of the $\lambda$ parameter. This can help when you're trying to assign a human interpretable name or "meaning" to each topic.

For (2), exploring the Intertopic Distance Plot can help you learn about how topics relate to each other, including potential higher-level structure between groups of topics.

Describing text with LDA

Beyond data exploration, one of the key uses for an LDA model is providing a compact, quantitative description of natural language text. Once an LDA model has been trained, it can be used to represent free text as a mixture of the topics the model learned from the original corpus. This mixture can be interpreted as a probability distribution across the topics, so the LDA representation of a paragraph of text might look like 50% Topic A, 20% Topic B, 20% Topic C, and 10% Topic D.

To use an LDA model to generate a vector representation of new text, you'll need to apply any text preprocessing steps you used on the model's training corpus to the new text, too. For our model, the preprocessing steps we used include:

Using spaCy to remove punctuation and lemmatize the text
Applying our first-order phrase model to join word pairs
Applying our second-order phrase model to join longer phrases
Removing stopwords
Creating a bag-of-words representation

Once you've applied these preprocessing steps to the new text, it's ready to pass directly to the model to create an LDA representation. The lda_description(...) function will perform all these steps for us, including printing the resulting topical description of the input text.



In [54]:

    
def get_sample_review(review_number):
    """
    retrieve a particular review index
    from the reviews file and return it
    """
    return list(it.islice(line_review(review_txt_filepath), review_number, review_number+1))[0]



In [71]:

    
def lda_description(review_text, min_topic_freq=0.05):
    """
    accept the original text of a review and (1) parse it with spaCy,
    (2) apply text pre-proccessing steps, (3) create a bag-of-words
    representation, (4) create an LDA representation, and
    (5) print a sorted list of the top topics in the LDA representation
    """
    
    # parse the review text with spaCy
    parsed_review = nlp(review_text)
    
    # lemmatize the text and remove punctuation and whitespace
    unigram_review = [token.lemma_ for token in parsed_review
                      if not punct_space(token)]
    
    # apply the first-order and secord-order phrase models
    bigram_review = bigram_phraser[unigram_review]
    trigram_review = trigram_phraser[bigram_review]
    
    # remove any remaining stopwords
    trigram_review = [term for term in trigram_review
                      if not term in spacy.en.STOPWORDS]
    
    # create a bag-of-words representation
    review_bow = trigram_dictionary.doc2bow(trigram_review)
    
    # create an LDA representation
    review_lda = lda[review_bow]
    
    # sort with the most highly related topics first
    review_lda = sorted(review_lda, key=lambda x: -x[1])

    for topic_number, freq in review_lda:
        if freq < min_topic_freq:
            break
            
        # print the most highly related topic names and frequencies
        #print('{:25} {}'.format(topic_names[topic_number], round(freq, 3)))
        print("{}".format([word for word, frq in lda.show_topic(topic_number, topn=10)]))



In [81]:

    
sample_review = get_sample_review(256)
print(sample_review)









    



We used to enjoy Luna's for breakfast. 
The last 2 times the food has been bad. 
The main reason we won't go back, the last time we were in our waitress went to the wrong table, realized it then cussed. No big deal if my child wasn't with us. 
Poor service and lousy food. Great prices though. Hopefully it gets better.



In [82]:

    
lda_description(sample_review)









    



['order', 'wait', 'time', 'come', 'ask', 'service', 'table', 'server', 'tell', 'waitress']
['service', 'bad', 'price', 'better', 'quality', 'time', 'restaurant', 'star', 'like', 'location']
['breakfast', 'coffee', 'egg', 'brunch', 'pancake', 'come', 'bacon', 'great', 'like', 'wait']

Word Vector Embedding with Word2Vec

Can you complete this text snippet?

You just demonstrated the core machine learning concept behind word vector embedding models!

The goal of word vector embedding models, is to learn dense, numerical vector representations for each term in a corpus vocabulary.
The vectors it learns about each term should encode some information about the meaning or concept the term represents...
... and the relationship between it and other terms in the vocabulary.
Word vector models are also fully unsupervised
The general idea of word2vec is, for a given focus word, to use the context of the word
To do this, word2vec uses a sliding window technique, where it considers snippets of text only a few tokens long at a time.
At the start of the learning process, the model initializes random vectors for all terms in the corpus vocabulary.
The model then slides the window across every snippet of text in the corpus.
Each time the model considers a new snippet, it tries to learn some information about the focus word.
Over time, the model rearranges the terms' vector representations such that terms that frequently appear in similar contexts have vector representations that are close to each other in vector space.

Word2vec has a number of user-defined hyperparameters, including:

The dimensionality of the vectors. Typical choices include a few dozen to several hundred.
The width of the sliding window, in tokens. Five is a common default choice, but narrower and wider windows are possible.
The number of training epochs.

For using word2vec in Python, gensim comes to the rescue again! It offers a highly-optimized, parallelized implementation of the word2vec algorithm with its Word2Vec class.



In [83]:

    
from gensim.models import Word2Vec

trigram_sentences = LineSentence(trigram_sentences_filepath)
word2vec_filepath = os.path.join(intermediate_directory, 'word2vec_model_all')

We'll train our word2vec model using the normalized sentences with our phrase models applied. We'll use 100-dimensional vectors, and set up our training process to run for twelve epochs.



In [84]:

    
%%time

# this is a bit time consuming - make the if statement True
# if you want to train the word2vec model yourself.
if False:

    # initiate the model and perform the first epoch of training
    food2vec = Word2Vec(trigram_sentences, size=100, window=5,
                        min_count=20, sg=1, workers=4)
    
    food2vec.save(word2vec_filepath)

    # perform another 11 epochs of training
    for i in range(1,12):

        food2vec.train(trigram_sentences)
        food2vec.save(word2vec_filepath)
        
# load the finished model from disk
food2vec = Word2Vec.load(word2vec_filepath)
food2vec.init_sims()

print('{} training epochs so far.'.format(food2vec.train_count))









    



12 training epochs so far.
CPU times: user 1.24 s, sys: 84 ms, total: 1.32 s
Wall time: 1.65 s



In [85]:

    
print('{} terms in the food2vec vocabulary.'.format(len(food2vec.wv.vocab)))









    



54471 terms in the food2vec vocabulary.

Let's take a look at the word vectors our model has learned.



In [87]:

    
# build a list of the terms, integer indices,
# and term counts from the food2vec model vocabulary
ordered_vocab = [(term, voc.index, voc.count)
                 for term, voc in food2vec.wv.vocab.items()]

# sort by the term counts, so the most common terms appear first
ordered_vocab = sorted(ordered_vocab, key=lambda x: -x[2])

# unzip the terms, integer indices, and counts into separate lists
ordered_terms, term_indices, term_counts = zip(*ordered_vocab)

# create a DataFrame with the food2vec vectors as data,
# and the terms as row labels
word_vectors = pd.DataFrame(food2vec.wv.syn0norm[term_indices, :],
                            index=ordered_terms)

word_vectors









    Out[87]:







  
    
      
      0
      1
      2
      3
      4
      5
      6
      7
      8
      9
      ...
      90
      91
      92
      93
      94
      95
      96
      97
      98
      99
    
  
  
    
      be
      0.010989
      -0.098580
      -0.110538
      0.105267
      -0.055592
      0.113280
      -0.089227
      0.036274
      -0.119766
      0.155558
      ...
      -0.061357
      -0.026129
      0.023734
      0.264967
      0.034338
      0.043833
      -0.060932
      0.059840
      -0.052866
      -0.110872
    
    
      the
      -0.083832
      -0.152506
      -0.020423
      0.054192
      -0.013097
      0.022567
      -0.182839
      0.145732
      -0.151383
      0.007145
      ...
      -0.010430
      0.025187
      -0.091061
      0.111775
      0.054993
      0.033513
      -0.015198
      -0.052137
      0.055071
      -0.003255
    
    
      and
      -0.010588
      -0.186269
      0.010744
      -0.009883
      -0.038689
      0.008270
      -0.228434
      0.087713
      -0.054035
      0.000333
      ...
      0.111528
      -0.095283
      -0.047689
      0.069654
      0.166584
      0.048733
      0.000678
      0.053589
      -0.016841
      -0.012481
    
    
      i
      -0.233326
      -0.112178
      -0.090333
      0.197657
      -0.068395
      0.057032
      -0.066095
      -0.048882
      -0.172856
      0.022692
      ...
      0.180020
      0.048842
      0.008773
      -0.030269
      -0.036047
      0.095967
      0.084133
      -0.053520
      -0.114601
      -0.090002
    
    
      a
      -0.148915
      -0.240049
      -0.254629
      0.107618
      0.009426
      0.007520
      -0.173996
      0.097633
      -0.067905
      0.067453
      ...
      0.159938
      0.060562
      -0.074305
      0.145551
      -0.053099
      -0.109160
      0.147147
      -0.073397
      -0.015614
      -0.033916
    
    
      to
      -0.147408
      -0.144266
      -0.072951
      0.055676
      -0.191244
      0.128530
      -0.186840
      0.005135
      -0.095664
      0.256163
      ...
      0.044601
      0.067998
      -0.027925
      0.016943
      0.036663
      0.086091
      -0.015894
      0.026792
      -0.146574
      -0.026264
    
    
      it
      -0.126023
      -0.229797
      -0.063922
      0.078058
      0.104852
      0.141732
      -0.109144
      0.143855
      -0.003734
      0.130395
      ...
      0.245292
      0.001967
      -0.082077
      0.083971
      0.094040
      0.090488
      0.153816
      -0.007391
      0.004069
      0.100732
    
    
      have
      -0.154819
      -0.211079
      -0.051330
      -0.025575
      -0.031188
      0.018657
      -0.139687
      0.042059
      -0.144194
      -0.022015
      ...
      -0.046991
      -0.039086
      -0.044456
      0.081584
      -0.029428
      0.092095
      0.018232
      -0.001896
      -0.121867
      -0.032781
    
    
      of
      -0.095553
      -0.040991
      -0.030406
      -0.054660
      0.033540
      0.123049
      -0.114816
      0.026980
      0.099128
      0.064965
      ...
      -0.011258
      -0.051014
      -0.166438
      0.218955
      -0.106607
      0.054742
      -0.112908
      0.036532
      -0.131346
      0.020330
    
    
      not
      -0.193577
      -0.091389
      -0.007253
      0.235965
      -0.186041
      0.175037
      -0.188715
      -0.029255
      -0.087940
      -0.061813
      ...
      0.043894
      -0.090898
      0.073228
      -0.062479
      0.047695
      0.041620
      0.060411
      0.112908
      -0.040006
      0.094330
    
    
      for
      -0.162588
      -0.159186
      -0.109132
      0.039062
      -0.078799
      0.105431
      0.025376
      0.110521
      0.003277
      0.087839
      ...
      0.034685
      -0.047097
      -0.100707
      0.061456
      0.043408
      -0.118207
      -0.208122
      0.001884
      -0.007666
      -0.202757
    
    
      in
      -0.163892
      -0.131131
      -0.039540
      -0.004122
      0.069849
      0.056035
      -0.042888
      0.073761
      0.026229
      -0.072630
      ...
      0.029818
      -0.075459
      0.065790
      0.159420
      0.012208
      -0.022801
      -0.032811
      -0.083407
      -0.181328
      -0.069649
    
    
      we
      -0.272565
      -0.008300
      -0.176883
      0.143808
      -0.065295
      0.076534
      -0.113614
      0.023955
      -0.121542
      -0.020234
      ...
      0.134583
      0.047980
      -0.067365
      -0.073131
      -0.011678
      -0.103804
      0.086547
      -0.055641
      -0.152656
      -0.118018
    
    
      that
      -0.061626
      -0.142808
      -0.015134
      0.176793
      -0.018912
      0.049744
      -0.089824
      -0.105453
      -0.018091
      0.083667
      ...
      0.116622
      -0.034512
      -0.055480
      0.157875
      0.037298
      0.083410
      0.075814
      0.135525
      0.022473
      0.116729
    
    
      with
      -0.020827
      -0.177243
      0.059412
      0.036498
      -0.035656
      -0.005235
      -0.067584
      0.004513
      0.008623
      -0.056032
      ...
      0.108537
      -0.122321
      -0.061740
      0.117970
      0.053333
      0.049695
      -0.009975
      0.095943
      0.118808
      0.021877
    
    
      but
      -0.119151
      -0.265154
      -0.048620
      0.083613
      -0.038625
      0.090675
      -0.070961
      0.113564
      -0.143763
      0.071602
      ...
      0.074623
      -0.038815
      -0.035740
      0.114733
      0.074686
      0.067988
      0.003482
      0.096448
      -0.075555
      -0.002138
    
    
      this
      -0.111883
      -0.105664
      -0.127085
      0.087595
      0.091652
      0.110283
      0.072742
      -0.012621
      -0.120455
      0.104141
      ...
      0.179183
      0.052422
      0.137306
      0.308990
      -0.047541
      0.089044
      0.140369
      0.047578
      0.068120
      -0.144343
    
    
      my
      -0.168239
      -0.145033
      -0.040333
      0.160379
      -0.074935
      -0.157678
      -0.025199
      -0.026598
      -0.088813
      0.067325
      ...
      0.133768
      0.004729
      0.087983
      -0.009961
      -0.174202
      0.090123
      -0.044980
      -0.073762
      -0.035681
      -0.044518
    
    
      you
      -0.207929
      -0.013835
      -0.215225
      0.082302
      -0.076653
      0.066247
      -0.021596
      0.066455
      -0.002939
      -0.000800
      ...
      0.003364
      0.009994
      -0.045274
      0.041161
      -0.018261
      0.002603
      0.081891
      0.072936
      -0.064707
      0.018436
    
    
      they
      -0.185486
      0.023026
      -0.045809
      0.123101
      -0.100993
      0.138095
      0.000396
      -0.094917
      0.034255
      0.026586
      ...
      -0.073804
      -0.144251
      -0.064284
      0.030771
      0.150469
      0.033499
      0.104982
      0.062396
      -0.039298
      0.063400
    
    
      food
      -0.076840
      -0.160895
      0.010319
      0.153751
      -0.045242
      0.057135
      -0.025237
      0.148921
      0.034291
      0.257439
      ...
      0.107853
      -0.110773
      -0.013110
      0.126202
      0.187992
      0.071631
      0.052972
      0.078340
      -0.134420
      -0.081099
    
    
      on
      -0.120097
      -0.114669
      -0.046513
      0.039304
      0.105749
      0.168850
      -0.078406
      0.181617
      0.112715
      0.136231
      ...
      -0.077621
      -0.059669
      -0.145316
      0.032041
      0.057588
      0.056613
      -0.037424
      -0.049387
      -0.063084
      -0.179428
    
    
      do
      -0.190176
      -0.051912
      -0.059052
      0.042066
      -0.055805
      0.119283
      -0.072995
      -0.007935
      -0.054306
      0.069180
      ...
      -0.083640
      0.068921
      0.157190
      0.011571
      0.059404
      0.062185
      -0.057972
      0.003046
      -0.136736
      0.050278
    
    
      place
      -0.128703
      -0.228257
      -0.257955
      0.004890
      0.027539
      0.064053
      0.045496
      0.009080
      -0.046814
      0.016866
      ...
      0.182027
      0.024901
      -0.048341
      0.089956
      0.026714
      0.259203
      0.062063
      0.081858
      -0.181257
      -0.234439
    
    
      good
      -0.145629
      0.017670
      -0.063299
      0.084828
      0.027693
      -0.047405
      -0.080710
      0.125334
      -0.127993
      0.050772
      ...
      0.018127
      0.003129
      -0.002458
      -0.163911
      0.107887
      0.023900
      0.018278
      0.168725
      0.069398
      -0.076847
    
    
      so
      -0.128288
      -0.169525
      -0.068537
      0.128895
      0.069810
      0.041731
      -0.112067
      0.164187
      -0.089032
      0.036957
      ...
      0.088735
      -0.047751
      -0.053913
      0.044182
      0.097814
      0.016458
      0.143908
      0.124385
      0.009824
      0.096503
    
    
      get
      -0.229219
      -0.121892
      -0.079313
      -0.001759
      -0.011349
      0.019230
      -0.011036
      -0.010977
      -0.018145
      0.154827
      ...
      -0.037836
      -0.044463
      -0.051856
      0.025975
      -0.033475
      -0.020916
      0.012968
      0.004603
      0.098589
      -0.090934
    
    
      go
      -0.115741
      -0.132709
      -0.136193
      0.100460
      0.019414
      -0.005347
      -0.007702
      0.055868
      0.001009
      0.216165
      ...
      0.056720
      -0.060848
      -0.011198
      0.038860
      0.012474
      -0.155026
      -0.032497
      -0.134072
      -0.099956
      -0.147530
    
    
      at
      -0.089568
      -0.167950
      -0.073460
      0.032098
      0.009155
      0.150129
      -0.130566
      -0.084257
      0.158555
      0.025244
      ...
      0.011845
      -0.069031
      -0.030641
      0.156362
      0.081821
      0.022435
      -0.164125
      0.008317
      -0.217913
      -0.171339
    
    
      order
      -0.130352
      -0.100845
      0.036228
      -0.016239
      -0.062027
      -0.017752
      -0.071569
      -0.159909
      -0.061374
      -0.005073
      ...
      -0.044469
      -0.034061
      0.002668
      0.052927
      -0.008593
      -0.077465
      0.046232
      -0.032159
      0.092767
      -0.032315
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      mignardises_cart
      0.007196
      0.063048
      -0.089753
      0.123686
      -0.093236
      0.074514
      -0.041362
      0.024160
      -0.237398
      0.232697
      ...
      -0.111951
      0.020387
      -0.222323
      0.108089
      0.012894
      -0.125431
      0.096809
      0.068629
      0.095394
      -0.017781
    
    
      mgm_mansion
      -0.177815
      -0.020556
      0.015554
      -0.096956
      -0.020830
      -0.015529
      -0.053997
      -0.066484
      0.092008
      0.040191
      ...
      -0.033002
      0.057827
      -0.224518
      0.331785
      0.058716
      -0.128749
      0.185193
      0.047847
      -0.091865
      0.044517
    
    
      sugar_sphere
      0.151706
      -0.017806
      -0.117379
      -0.018345
      0.037740
      -0.020299
      -0.121451
      -0.003332
      -0.087812
      0.013265
      ...
      0.062931
      -0.022722
      -0.147668
      0.092152
      0.079211
      -0.027837
      0.107060
      -0.069495
      0.118309
      0.036158
    
    
      pecan_crusted_bacon
      -0.073708
      0.115172
      -0.182485
      0.014771
      -0.095001
      -0.002712
      0.223700
      0.016523
      -0.153177
      0.161219
      ...
      0.040141
      0.005201
      0.007469
      0.045565
      0.178654
      -0.055843
      -0.052686
      0.025408
      0.209677
      -0.011108
    
    
      dbg
      -0.073444
      0.021847
      -0.069257
      0.067224
      -0.062583
      0.142269
      -0.037266
      0.147795
      0.007936
      0.048285
      ...
      -0.011592
      -0.142617
      -0.219104
      0.187848
      -0.007989
      -0.069616
      0.161464
      -0.123364
      -0.008985
      -0.121119
    
    
      cherry_popper_burger
      0.105912
      -0.007242
      0.032225
      -0.011166
      0.078150
      -0.106550
      0.050849
      0.025997
      -0.015825
      0.051877
      ...
      0.001406
      -0.026521
      -0.020386
      0.164559
      0.048097
      0.099050
      0.026641
      -0.151952
      -0.110753
      -0.136039
    
    
      panko_tofu
      0.072164
      -0.061906
      -0.196437
      0.128953
      -0.217588
      0.082927
      -0.076290
      0.043308
      -0.053061
      0.028401
      ...
      0.105428
      0.019887
      0.038083
      -0.058264
      0.109253
      -0.087129
      0.005693
      0.124236
      0.011175
      -0.016825
    
    
      arts_factory
      -0.023922
      -0.032381
      -0.059455
      -0.161107
      0.175643
      0.112984
      -0.020441
      -0.007810
      -0.006716
      0.075937
      ...
      -0.009376
      -0.152361
      -0.160350
      0.203146
      -0.064621
      -0.055807
      0.205863
      -0.056981
      -0.063470
      -0.082734
    
    
      waffle_cristo
      0.064633
      0.010441
      -0.087705
      -0.206701
      -0.086691
      -0.098219
      -0.070669
      0.089888
      -0.139562
      0.207669
      ...
      -0.037377
      0.019772
      -0.023730
      0.131557
      0.125975
      -0.149205
      -0.057908
      -0.021090
      0.166594
      -0.093824
    
    
      live_bull_riding
      -0.020121
      -0.037185
      -0.073709
      0.024790
      -0.044547
      0.074793
      -0.211100
      0.018721
      0.026588
      0.161480
      ...
      -0.073679
      -0.262422
      0.108220
      0.132400
      0.178973
      -0.226846
      0.038616
      -0.041793
      -0.098383
      -0.060971
    
    
      joe_'s_panini
      0.054685
      -0.018606
      -0.062711
      0.022298
      0.055120
      0.199279
      0.013477
      -0.016623
      -0.120566
      -0.036809
      ...
      0.117595
      0.077953
      -0.057261
      -0.097247
      0.214030
      0.068710
      0.140257
      0.071008
      0.048277
      -0.175368
    
    
      utsav
      -0.036028
      -0.150261
      -0.091530
      -0.093315
      -0.124048
      0.055841
      0.018046
      0.057999
      0.075539
      0.026202
      ...
      0.038337
      -0.039832
      0.114909
      0.174944
      -0.071624
      0.102008
      0.140138
      -0.105005
      0.049110
      -0.171333
    
    
      salt_caramel_reduction
      -0.034777
      -0.065648
      -0.111095
      0.188425
      0.059281
      0.084901
      0.064198
      0.072816
      -0.071933
      -0.101628
      ...
      -0.032517
      -0.048360
      -0.128952
      0.192814
      0.074715
      -0.133373
      -0.023092
      0.010593
      0.073630
      0.074200
    
    
      paco_taco
      -0.172571
      0.029508
      -0.170693
      0.061910
      0.002295
      -0.027510
      -0.037892
      0.206089
      0.055669
      0.026415
      ...
      -0.031886
      -0.017035
      -0.031281
      0.122963
      0.182497
      -0.217789
      0.089432
      -0.036055
      0.024112
      -0.077095
    
    
      spring_roll_factory
      -0.152873
      -0.016561
      -0.163311
      0.080655
      -0.052985
      0.035782
      0.035979
      -0.093912
      -0.100126
      -0.079203
      ...
      0.131779
      0.026861
      0.008073
      0.083142
      -0.033400
      0.166856
      0.207260
      -0.058668
      -0.103536
      -0.077231
    
    
      walu
      0.003410
      -0.198903
      -0.046219
      0.032778
      0.103021
      -0.046970
      -0.117032
      0.056857
      -0.031481
      0.043278
      ...
      0.011977
      -0.010446
      0.003516
      -0.033999
      -0.011823
      -0.093771
      0.015525
      -0.068369
      0.019892
      -0.134252
    
    
      nyks
      0.077990
      -0.088403
      -0.091568
      0.005243
      0.062576
      0.062798
      -0.086387
      -0.106902
      0.119354
      -0.028575
      ...
      -0.125958
      -0.199749
      0.006240
      0.179371
      -0.070566
      0.059770
      0.165530
      -0.029762
      -0.099495
      -0.126951
    
    
      hotel_du_vin
      -0.097392
      -0.082668
      -0.162407
      0.074378
      0.088921
      0.089053
      -0.069399
      -0.224495
      0.005560
      -0.047343
      ...
      -0.045000
      -0.017609
      0.004789
      0.203957
      -0.112103
      0.149513
      0.179853
      0.029651
      -0.094854
      -0.201611
    
    
      tmc_croissant
      -0.029460
      -0.013684
      -0.069493
      -0.081460
      0.002520
      0.031066
      -0.041964
      -0.032799
      -0.197903
      -0.061412
      ...
      0.102436
      0.006566
      -0.017118
      -0.008849
      0.097136
      0.063581
      0.088552
      0.169504
      -0.056177
      -0.132385
    
    
      nicey_'s
      -0.023139
      0.061560
      -0.012481
      -0.075036
      -0.002402
      0.129344
      -0.022766
      -0.013659
      0.057510
      0.112790
      ...
      0.042190
      -0.312797
      -0.089831
      0.048956
      0.062004
      0.064891
      0.133679
      -0.198884
      0.139090
      -0.111885
    
    
      pig_destroyer
      -0.029369
      0.059976
      -0.092597
      0.127759
      -0.076446
      0.142983
      -0.119295
      0.197889
      -0.080670
      0.003274
      ...
      0.082677
      -0.041166
      -0.011940
      0.093334
      0.117772
      -0.054582
      0.075983
      -0.050376
      -0.107944
      -0.256598
    
    
      mystery_slider
      -0.036609
      -0.043959
      0.044160
      0.099741
      -0.091817
      -0.122815
      0.168564
      0.035759
      -0.005760
      0.082288
      ...
      -0.093348
      0.011129
      -0.050653
      0.058820
      0.210481
      -0.025544
      0.098406
      0.018150
      0.007280
      -0.031091
    
    
      blackthorne
      -0.166732
      -0.103467
      -0.082595
      0.081413
      0.070971
      0.116325
      0.124752
      -0.197697
      -0.038686
      0.093534
      ...
      0.012453
      0.007383
      0.032275
      -0.037132
      -0.009771
      -0.030527
      0.152668
      -0.120817
      0.049524
      -0.167222
    
    
      hana_dog
      -0.005503
      -0.038900
      -0.034886
      0.109766
      0.142516
      -0.066498
      0.060914
      0.199668
      0.051472
      -0.047055
      ...
      -0.008223
      -0.051110
      0.016661
      0.154234
      0.102785
      -0.149700
      0.084893
      -0.148053
      -0.015030
      -0.077490
    
    
      butterfly_atrium
      -0.184075
      -0.104640
      -0.187233
      0.144733
      -0.063981
      0.135346
      0.040865
      0.004877
      -0.005636
      -0.019271
      ...
      0.008337
      -0.023550
      -0.071076
      0.249420
      0.061651
      -0.087122
      0.009406
      0.016582
      -0.101101
      0.085414
    
    
      butterfly_land
      -0.174988
      -0.255690
      -0.023740
      0.077248
      0.007035
      0.094403
      0.011344
      0.115827
      0.076652
      0.075374
      ...
      -0.065178
      -0.104544
      -0.016208
      0.113930
      -0.137429
      -0.066448
      0.012164
      -0.119094
      0.049803
      -0.015511
    
    
      latea
      -0.136236
      0.014893
      -0.126252
      -0.046524
      -0.178891
      -0.004986
      0.048919
      -0.134382
      0.070180
      -0.078089
      ...
      0.011183
      -0.086829
      0.086073
      0.114207
      0.110448
      0.254396
      0.056776
      -0.142841
      0.017397
      0.033895
    
    
      cbp
      -0.038400
      -0.004677
      -0.044935
      -0.004361
      -0.079060
      0.049842
      0.223872
      0.094907
      -0.152996
      0.101526
      ...
      0.037062
      -0.038483
      -0.097984
      0.343373
      0.087595
      0.030496
      0.262317
      -0.021849
      0.050947
      -0.247736
    
    
      maryln_'s
      -0.151859
      0.092219
      -0.162672
      0.041987
      0.006465
      0.031785
      -0.025967
      0.046757
      0.086988
      0.076349
      ...
      -0.102409
      0.013456
      0.066094
      0.022523
      0.070763
      0.034809
      0.205410
      -0.111769
      -0.009630
      -0.171394
    
    
      spiced_pork_tenderloin
      -0.043739
      -0.153803
      -0.014915
      0.014596
      0.006403
      0.015572
      -0.096952
      0.153498
      0.043043
      0.072373
      ...
      -0.018109
      -0.067642
      0.062494
      -0.183090
      0.123236
      -0.020943
      0.106726
      0.178398
      -0.044977
      -0.228292
    
  

54471 rows × 100 columns

What is the size of the wall of numbers?

So... what can we do with all these numbers?

The first thing we can use them for is to simply look up related words and phrases for a given term of interest.



In [92]:

    
def get_related_terms(token, topn=10):
    """
    look up the topn most similar terms to token
    and print them as a formatted list
    """
    for word, similarity in food2vec.most_similar(positive=[token], topn=topn):
        print('{:20} {}'.format(word, round(similarity, 3)))

What things are like Burger King?



In [99]:

    
get_related_terms('burger_king')









    



mcdonalds            0.931
mcdonald_'s          0.929
wendy_'s             0.907
mcd_'s               0.875
bk                   0.849
mcd                  0.834
mc_donald_'s         0.834
carl_'s_jr.          0.809
harvey_'s            0.795
five_guy             0.781

The model has learned that fast food restaurants are similar to each other! In particular, mcdonalds and wendy's are the most similar to Burger King, according to this dataset. In addition, the model has found that alternate spellings for the same entities are probably related, such as mcdonalds, mcdonald's and mcd's.



In [100]:

    
get_related_terms('soccer')









    



hockey               0.845
rugby                0.793
sport                0.787
football             0.775
steelers             0.763
espn                 0.755
basketball           0.742
soccer_game          0.736
football_game        0.708
olympics             0.705



In [101]:

    
get_related_terms('fork')









    



knife                0.913
butter_knife         0.819
spoon                0.797
steak_knife          0.783
chopstick            0.747
plastic_fork         0.741
plastic_knife        0.74
chop_stick           0.723
napkin               0.701
sharp_knife          0.693



In [103]:

    
get_related_terms('apple')









    



pear                 0.871
pecan                0.855
almond               0.842
apricot              0.842
peach                0.816
fig                  0.816
green_apple          0.802
blueberry            0.8
rhubarb              0.799
walnut               0.797



In [61]:

    
get_related_terms('happy_hour')









    



reverse_happy_hour   0.87
hh                   0.869
happy_hr             0.822
hh.                  0.818
during_happy_hour    0.789
taco_tuesday         0.732
m_f                  0.709
3pm-6pm              0.707
3_7pm                0.7
happy_hours          0.699



In [107]:

    
get_related_terms('tip')









    



gratuity             0.762
tax                  0.727
tab                  0.687
18_tip               0.686
bill                 0.681
$_120                0.679
include_tax          0.662
20_gratuity          0.661
$_100                0.658
$_150                0.657



In [62]:

    
get_related_terms('pasta', topn=20)









    



fettuccine           0.836
penne                0.835
lasagna              0.834
spaghetti            0.83
rigatoni             0.826
bolognese            0.822
lasagne              0.786
linguine             0.785
alfredo              0.785
tortellini           0.783
gnocchi              0.782
linguini             0.778
chicken_parm         0.775
manicotti            0.775
eggplant_parmesan    0.774
ravioli              0.773
penne_pasta          0.773
angel_hair           0.768
fettuccini           0.76
meatball             0.758

Word algebra!

The core idea is that once words are represented as numerical vectors, you can do math with them. The mathematical procedure goes like this:

Provide a set of words or phrases that you'd like to add or subtract.
Look up the vectors that represent those terms in the word vector model.
Add and subtract those vectors to produce a new, combined vector.
Look up the most similar vector(s) to this new, combined vector via cosine similarity.
Return the word(s) associated with the similar vector(s).

But more generally, you can think of the vectors that represent each word as encoding some information about the meaning or concepts of the word. What happens when you ask the model to combine the meaning and concepts of words in new ways? Let's see.



In [109]:

    
def word_algebra(add=[], subtract=[], topn=1):
    """
    combine the vectors associated with the words provided
    in add= and subtract=, look up the topn most similar
    terms to the combined vector, and print the result(s)
    """
    answers = food2vec.most_similar(positive=add, negative=subtract, topn=topn)    
    for term, similarity in answers:
        print(term)



In [110]:

    
word_algebra(add=['breakfast', 'lunch'])









    



brunch



In [111]:

    
word_algebra(add=['lunch', 'night'], subtract=['day'])









    



dinner



In [112]:

    
word_algebra(add=['taco', 'chinese'], subtract=['mexican'])









    



dumpling



In [113]:

    
word_algebra(add=['bun', 'mexican'], subtract=['american'])









    



tortilla



In [114]:

    
word_algebra(add=['coffee', 'snack'], subtract=['drink'])









    



pastry



In [115]:

    
word_algebra(add=['burger_king', 'pizza'])









    



pizza_hut



In [118]:

    
word_algebra(add=['wine', 'hops'], subtract=['grapes'])









    



craft_beer

Word Vector Visualization with t-SNE

t-Distributed Stochastic Neighbor Embedding, or t-SNE for short, is a dimensionality reduction technique to assist with visualizing high-dimensional datasets.
It attempts to map high-dimensional data onto a low two- or three-dimensional representation such that the relative distances between points are preserved as closely as possible in both high-dimensional and low-dimensional space.

scikit-learn provides a convenient implementation of the t-SNE algorithm with its TSNE class.



In [120]:

    
from sklearn.manifold import TSNE

Our input for t-SNE will be the DataFrame of word vectors we created before. Let's first:

Drop stopwords — it's probably not too interesting to visualize the, of, or, and so on
Take only the 5,000 most frequent terms in the vocabulary — no need to visualize all ~50,000 terms right now.



In [121]:

    
tsne_input = word_vectors.drop(spacy.en.STOPWORDS, errors='ignore')
tsne_input = tsne_input.head(5000)



In [123]:

    
tsne_input.head(10)









    Out[123]:







  
    
      
      0
      1
      2
      3
      4
      5
      6
      7
      8
      9
      ...
      90
      91
      92
      93
      94
      95
      96
      97
      98
      99
    
  
  
    
      food
      -0.076840
      -0.160895
      0.010319
      0.153751
      -0.045242
      0.057135
      -0.025237
      0.148921
      0.034291
      0.257439
      ...
      0.107853
      -0.110773
      -0.013110
      0.126202
      0.187992
      0.071631
      0.052972
      0.078340
      -0.134420
      -0.081099
    
    
      place
      -0.128703
      -0.228257
      -0.257955
      0.004890
      0.027539
      0.064053
      0.045496
      0.009080
      -0.046814
      0.016866
      ...
      0.182027
      0.024901
      -0.048341
      0.089956
      0.026714
      0.259203
      0.062063
      0.081858
      -0.181257
      -0.234439
    
    
      good
      -0.145629
      0.017670
      -0.063299
      0.084828
      0.027693
      -0.047405
      -0.080710
      0.125334
      -0.127993
      0.050772
      ...
      0.018127
      0.003129
      -0.002458
      -0.163911
      0.107887
      0.023900
      0.018278
      0.168725
      0.069398
      -0.076847
    
    
      order
      -0.130352
      -0.100845
      0.036228
      -0.016239
      -0.062027
      -0.017752
      -0.071569
      -0.159909
      -0.061374
      -0.005073
      ...
      -0.044469
      -0.034061
      0.002668
      0.052927
      -0.008593
      -0.077465
      0.046232
      -0.032159
      0.092767
      -0.032315
    
    
      great
      0.018282
      -0.108175
      -0.059079
      -0.058285
      0.093526
      -0.042195
      -0.156941
      0.153408
      -0.099760
      0.008468
      ...
      -0.052875
      0.036250
      -0.121445
      -0.034261
      0.180397
      -0.051449
      -0.040137
      0.107300
      -0.021731
      -0.061563
    
    
      like
      -0.240547
      -0.189648
      -0.002598
      -0.051824
      0.100519
      0.049585
      -0.160006
      -0.151981
      -0.000614
      0.004996
      ...
      -0.140863
      0.087466
      0.102191
      -0.040661
      0.032067
      -0.066908
      -0.123889
      0.154099
      -0.122684
      0.001002
    
    
      '
      0.040838
      -0.154406
      -0.109304
      0.101687
      0.163430
      0.241373
      -0.011948
      0.131574
      -0.024694
      0.196428
      ...
      0.041378
      0.056746
      0.002592
      0.164326
      0.012275
      0.004702
      0.126416
      0.007788
      -0.002569
      -0.029380
    
    
      come
      -0.018099
      -0.082972
      -0.102738
      -0.066879
      -0.114218
      0.032960
      -0.007437
      -0.055022
      -0.074295
      0.198018
      ...
      -0.008440
      -0.082673
      -0.033042
      0.152236
      0.020270
      -0.171674
      -0.082296
      -0.063716
      0.023801
      -0.057307
    
    
      service
      0.005536
      -0.202007
      -0.218962
      0.159426
      -0.147472
      0.168582
      0.006629
      0.116270
      -0.164493
      0.069136
      ...
      -0.031283
      -0.081114
      -0.011484
      0.163603
      0.243680
      0.192322
      -0.034619
      0.145869
      -0.004494
      0.010007
    
    
      time
      -0.077446
      -0.132936
      -0.180608
      -0.003777
      -0.127893
      0.161376
      -0.166147
      0.076802
      -0.032563
      0.114033
      ...
      0.069907
      -0.008498
      -0.054251
      0.044493
      -0.085471
      0.084672
      0.078771
      -0.024323
      -0.141198
      -0.109735
    
  

10 rows × 100 columns



In [124]:

    
tsne_filepath = os.path.join(intermediate_directory, 'tsne_model')
tsne_vectors_filepath = os.path.join(intermediate_directory, 'tsne_vectors.npy')



In [128]:

    
%%time

if False:
    tsne = TSNE()
    tsne_vectors = tsne.fit_transform(tsne_input.values)
    
    with open(tsne_filepath, 'wb') as f:
        pickle.dump(tsne, f)

    pd.np.save(tsne_vectors_filepath, tsne_vectors)
    
with open(tsne_filepath, 'rb') as f:
    tsne = pickle.load(f)
    
tsne_vectors = pd.np.load(tsne_vectors_filepath)

tsne_vectors = pd.DataFrame(tsne_vectors,
                            index=pd.Index(tsne_input.index),
                            columns=['x_coord', 'y_coord'])









    



CPU times: user 0 ns, sys: 4 ms, total: 4 ms
Wall time: 4.49 ms

Now we have a two-dimensional representation of our data! Let's take a look.



In [129]:

    
tsne_vectors.head()



In [130]:

    
tsne_vectors['word'] = tsne_vectors.index

Plotting with Bokeh



In [134]:

    
from bokeh.plotting import figure, show, output_notebook
from bokeh.models import HoverTool, ColumnDataSource, value

output_notebook()









    



/home/user/projekte/anaconda3/envs/ipynb/lib/python3.6/site-packages/bokeh/core/json_encoder.py:52: DeprecationWarning: parsing timezone aware datetimes is deprecated; this will raise an error in the future
  NP_EPOCH = np.datetime64('1970-01-01T00:00:00Z')






    





    
        
        Loading BokehJS ...



In [135]:

    
# add our DataFrame as a ColumnDataSource for Bokeh
plot_data = ColumnDataSource(tsne_vectors)

# create the plot and configure the
# title, dimensions, and tools
tsne_plot = figure(title='t-SNE Word Embeddings',
                   plot_width = 800,
                   plot_height = 800,
                   tools= ('pan, wheel_zoom, box_zoom,'
                           'box_select, resize, reset'),
                   active_scroll='wheel_zoom')

# add a hover tool to display words on roll-over
tsne_plot.add_tools( HoverTool(tooltips = '@word') )

# draw the words as circles on the plot
tsne_plot.circle('x_coord', 'y_coord', source=plot_data,
                 color='blue', line_alpha=0.2, fill_alpha=0.1,
                 size=10, hover_line_color='black')

# configure visual elements of the plot
tsne_plot.title.text_font_size = value('16pt')
tsne_plot.xaxis.visible = False
tsne_plot.yaxis.visible = False
tsne_plot.grid.grid_line_color = None
tsne_plot.outline_line_color = None

# engage!
show(tsne_plot);

Conclusion

Whew! Let's round up the major components that we've seen:

Text processing with spaCy
Automated phrase modeling
Topic modeling with LDA $\ \longrightarrow\ $ visualization with pyLDAvis
Word vector modeling with word2vec $\ \longrightarrow\ $ visualization with t-SNE

Why use these models?

Dense vector representations for text like LDA and word2vec can greatly improve performance for a number of common, text-heavy problems like:

Text classification
Search
Recommendations
Question answering

...and more generally are a powerful way machines can help humans make sense of what's in a giant pile of text. They're also often useful as a pre-processing step for many other downstream machine learning applications.

	x_coord	y_coord
food	-2.411427	-8.838213
place	-5.883048	8.128523
good	7.497086	2.369288
order	-2.360771	-4.572480
great	7.660513	3.280378

	token_text	part_of_speech
0	Khotan	PROPN
1	sucks	VERB
2	as	ADP
3	a	DET
4	place	NOUN
5	to	PART
6	eat	VERB
7	.	PUNCT
8	The	DET
9	girls	NOUN
10	and	CONJ
11	I	PRON
12	went	VERB
13	to	ADP
14	this	DET
15	place	NOUN
16	for	ADP
17	"	PUNCT
18	Dinner	PROPN
19	"	PUNCT
20	but	CONJ
21	really	ADV
22	all	DET
23	we	PRON
24	got	VERB
25	was	VERB
26	appetizers	NOUN
27	.	PUNCT
28	We	PRON
29	went	VERB
...	...	...
106	two	NUM
107	stars	NOUN
108	out	ADP
109	of	ADP
110	me	PRON
111	was	VERB
112	the	DET
113	bartender	NOUN
114	.	PUNCT
115	They	PRON
116	both	DET
117	were	VERB
118	entertaining	ADJ
119	and	CONJ
120	he	PRON
121	gave	VERB
122	us	PRON
123	some	DET
124	drinks	NOUN
125	for	ADP
126	free	ADJ
127	.	PUNCT
128	Atmosphere	PROPN
129	is	VERB
130	really	ADV
131	not	ADV
132	impressive	ADJ
133	either	ADV
134	.	PUNCT
135	\n	SPACE

	text	log_probability	stop?	punctuation?	whitespace?	number?	out of vocab.?
0	Khotan	-19.502029					Yes
1	sucks	-9.515271
2	as	-5.534485	Yes
3	a	-3.929788	Yes
4	place	-7.954748
5	to	-3.856022	Yes
6	eat	-8.822906
7	.	-3.067898		Yes
8	The	-5.958707	Yes
9	girls	-9.163706
10	and	-4.113108	Yes
11	I	-3.791565	Yes
12	went	-8.091074
13	to	-3.856022	Yes
14	this	-5.361816	Yes
15	place	-7.954748
16	for	-4.880109	Yes
17	"	-5.026776		Yes
18	Dinner	-13.237526
19	"	-5.026776		Yes
20	but	-5.341969	Yes
21	really	-6.370757	Yes
22	all	-5.936641	Yes
23	we	-6.230024	Yes
24	got	-6.988554
25	was	-5.252320	Yes
26	appetizers	-14.932755
27	.	-3.067898		Yes
28	We	-7.402578	Yes
29	went	-8.091074
...	...	...	...	...	...	...	...
106	two	-7.433600	Yes			Yes
107	stars	-10.723899
108	out	-6.002701	Yes
109	of	-4.275874	Yes
110	me	-5.846090	Yes
111	was	-5.252320	Yes
112	the	-3.528767	Yes
113	bartender	-12.436255
114	.	-3.067898		Yes
115	They	-7.078901	Yes
116	both	-7.750915	Yes
117	were	-6.673175	Yes
118	entertaining	-10.942356
119	and	-4.113108	Yes
120	he	-5.931905	Yes
121	gave	-8.983262
122	us	-7.643694	Yes
123	some	-6.318882	Yes
124	drinks	-10.731253
125	for	-4.880109	Yes
126	free	-8.044009
127	.	-3.067898		Yes
128	Atmosphere	-14.375864
129	is	-4.457749	Yes
130	really	-6.370757	Yes
131	not	-5.332601	Yes
132	impressive	-10.566375
133	either	-7.965898	Yes
134	.	-3.067898		Yes
135	\n	-6.050651			Yes

	0	1	2	3	4	5	6	7	8	9	...	90	91	92	93	94	95	96	97	98	99
be	0.010989	-0.098580	-0.110538	0.105267	-0.055592	0.113280	-0.089227	0.036274	-0.119766	0.155558	...	-0.061357	-0.026129	0.023734	0.264967	0.034338	0.043833	-0.060932	0.059840	-0.052866	-0.110872
the	-0.083832	-0.152506	-0.020423	0.054192	-0.013097	0.022567	-0.182839	0.145732	-0.151383	0.007145	...	-0.010430	0.025187	-0.091061	0.111775	0.054993	0.033513	-0.015198	-0.052137	0.055071	-0.003255
and	-0.010588	-0.186269	0.010744	-0.009883	-0.038689	0.008270	-0.228434	0.087713	-0.054035	0.000333	...	0.111528	-0.095283	-0.047689	0.069654	0.166584	0.048733	0.000678	0.053589	-0.016841	-0.012481
i	-0.233326	-0.112178	-0.090333	0.197657	-0.068395	0.057032	-0.066095	-0.048882	-0.172856	0.022692	...	0.180020	0.048842	0.008773	-0.030269	-0.036047	0.095967	0.084133	-0.053520	-0.114601	-0.090002
a	-0.148915	-0.240049	-0.254629	0.107618	0.009426	0.007520	-0.173996	0.097633	-0.067905	0.067453	...	0.159938	0.060562	-0.074305	0.145551	-0.053099	-0.109160	0.147147	-0.073397	-0.015614	-0.033916
to	-0.147408	-0.144266	-0.072951	0.055676	-0.191244	0.128530	-0.186840	0.005135	-0.095664	0.256163	...	0.044601	0.067998	-0.027925	0.016943	0.036663	0.086091	-0.015894	0.026792	-0.146574	-0.026264
it	-0.126023	-0.229797	-0.063922	0.078058	0.104852	0.141732	-0.109144	0.143855	-0.003734	0.130395	...	0.245292	0.001967	-0.082077	0.083971	0.094040	0.090488	0.153816	-0.007391	0.004069	0.100732
have	-0.154819	-0.211079	-0.051330	-0.025575	-0.031188	0.018657	-0.139687	0.042059	-0.144194	-0.022015	...	-0.046991	-0.039086	-0.044456	0.081584	-0.029428	0.092095	0.018232	-0.001896	-0.121867	-0.032781
of	-0.095553	-0.040991	-0.030406	-0.054660	0.033540	0.123049	-0.114816	0.026980	0.099128	0.064965	...	-0.011258	-0.051014	-0.166438	0.218955	-0.106607	0.054742	-0.112908	0.036532	-0.131346	0.020330
not	-0.193577	-0.091389	-0.007253	0.235965	-0.186041	0.175037	-0.188715	-0.029255	-0.087940	-0.061813	...	0.043894	-0.090898	0.073228	-0.062479	0.047695	0.041620	0.060411	0.112908	-0.040006	0.094330
for	-0.162588	-0.159186	-0.109132	0.039062	-0.078799	0.105431	0.025376	0.110521	0.003277	0.087839	...	0.034685	-0.047097	-0.100707	0.061456	0.043408	-0.118207	-0.208122	0.001884	-0.007666	-0.202757
in	-0.163892	-0.131131	-0.039540	-0.004122	0.069849	0.056035	-0.042888	0.073761	0.026229	-0.072630	...	0.029818	-0.075459	0.065790	0.159420	0.012208	-0.022801	-0.032811	-0.083407	-0.181328	-0.069649
we	-0.272565	-0.008300	-0.176883	0.143808	-0.065295	0.076534	-0.113614	0.023955	-0.121542	-0.020234	...	0.134583	0.047980	-0.067365	-0.073131	-0.011678	-0.103804	0.086547	-0.055641	-0.152656	-0.118018
that	-0.061626	-0.142808	-0.015134	0.176793	-0.018912	0.049744	-0.089824	-0.105453	-0.018091	0.083667	...	0.116622	-0.034512	-0.055480	0.157875	0.037298	0.083410	0.075814	0.135525	0.022473	0.116729
with	-0.020827	-0.177243	0.059412	0.036498	-0.035656	-0.005235	-0.067584	0.004513	0.008623	-0.056032	...	0.108537	-0.122321	-0.061740	0.117970	0.053333	0.049695	-0.009975	0.095943	0.118808	0.021877
but	-0.119151	-0.265154	-0.048620	0.083613	-0.038625	0.090675	-0.070961	0.113564	-0.143763	0.071602	...	0.074623	-0.038815	-0.035740	0.114733	0.074686	0.067988	0.003482	0.096448	-0.075555	-0.002138
this	-0.111883	-0.105664	-0.127085	0.087595	0.091652	0.110283	0.072742	-0.012621	-0.120455	0.104141	...	0.179183	0.052422	0.137306	0.308990	-0.047541	0.089044	0.140369	0.047578	0.068120	-0.144343
my	-0.168239	-0.145033	-0.040333	0.160379	-0.074935	-0.157678	-0.025199	-0.026598	-0.088813	0.067325	...	0.133768	0.004729	0.087983	-0.009961	-0.174202	0.090123	-0.044980	-0.073762	-0.035681	-0.044518
you	-0.207929	-0.013835	-0.215225	0.082302	-0.076653	0.066247	-0.021596	0.066455	-0.002939	-0.000800	...	0.003364	0.009994	-0.045274	0.041161	-0.018261	0.002603	0.081891	0.072936	-0.064707	0.018436
they	-0.185486	0.023026	-0.045809	0.123101	-0.100993	0.138095	0.000396	-0.094917	0.034255	0.026586	...	-0.073804	-0.144251	-0.064284	0.030771	0.150469	0.033499	0.104982	0.062396	-0.039298	0.063400
food	-0.076840	-0.160895	0.010319	0.153751	-0.045242	0.057135	-0.025237	0.148921	0.034291	0.257439	...	0.107853	-0.110773	-0.013110	0.126202	0.187992	0.071631	0.052972	0.078340	-0.134420	-0.081099
on	-0.120097	-0.114669	-0.046513	0.039304	0.105749	0.168850	-0.078406	0.181617	0.112715	0.136231	...	-0.077621	-0.059669	-0.145316	0.032041	0.057588	0.056613	-0.037424	-0.049387	-0.063084	-0.179428
do	-0.190176	-0.051912	-0.059052	0.042066	-0.055805	0.119283	-0.072995	-0.007935	-0.054306	0.069180	...	-0.083640	0.068921	0.157190	0.011571	0.059404	0.062185	-0.057972	0.003046	-0.136736	0.050278
place	-0.128703	-0.228257	-0.257955	0.004890	0.027539	0.064053	0.045496	0.009080	-0.046814	0.016866	...	0.182027	0.024901	-0.048341	0.089956	0.026714	0.259203	0.062063	0.081858	-0.181257	-0.234439
good	-0.145629	0.017670	-0.063299	0.084828	0.027693	-0.047405	-0.080710	0.125334	-0.127993	0.050772	...	0.018127	0.003129	-0.002458	-0.163911	0.107887	0.023900	0.018278	0.168725	0.069398	-0.076847
so	-0.128288	-0.169525	-0.068537	0.128895	0.069810	0.041731	-0.112067	0.164187	-0.089032	0.036957	...	0.088735	-0.047751	-0.053913	0.044182	0.097814	0.016458	0.143908	0.124385	0.009824	0.096503
get	-0.229219	-0.121892	-0.079313	-0.001759	-0.011349	0.019230	-0.011036	-0.010977	-0.018145	0.154827	...	-0.037836	-0.044463	-0.051856	0.025975	-0.033475	-0.020916	0.012968	0.004603	0.098589	-0.090934
go	-0.115741	-0.132709	-0.136193	0.100460	0.019414	-0.005347	-0.007702	0.055868	0.001009	0.216165	...	0.056720	-0.060848	-0.011198	0.038860	0.012474	-0.155026	-0.032497	-0.134072	-0.099956	-0.147530
at	-0.089568	-0.167950	-0.073460	0.032098	0.009155	0.150129	-0.130566	-0.084257	0.158555	0.025244	...	0.011845	-0.069031	-0.030641	0.156362	0.081821	0.022435	-0.164125	0.008317	-0.217913	-0.171339
order	-0.130352	-0.100845	0.036228	-0.016239	-0.062027	-0.017752	-0.071569	-0.159909	-0.061374	-0.005073	...	-0.044469	-0.034061	0.002668	0.052927	-0.008593	-0.077465	0.046232	-0.032159	0.092767	-0.032315
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
mignardises_cart	0.007196	0.063048	-0.089753	0.123686	-0.093236	0.074514	-0.041362	0.024160	-0.237398	0.232697	...	-0.111951	0.020387	-0.222323	0.108089	0.012894	-0.125431	0.096809	0.068629	0.095394	-0.017781
mgm_mansion	-0.177815	-0.020556	0.015554	-0.096956	-0.020830	-0.015529	-0.053997	-0.066484	0.092008	0.040191	...	-0.033002	0.057827	-0.224518	0.331785	0.058716	-0.128749	0.185193	0.047847	-0.091865	0.044517
sugar_sphere	0.151706	-0.017806	-0.117379	-0.018345	0.037740	-0.020299	-0.121451	-0.003332	-0.087812	0.013265	...	0.062931	-0.022722	-0.147668	0.092152	0.079211	-0.027837	0.107060	-0.069495	0.118309	0.036158
pecan_crusted_bacon	-0.073708	0.115172	-0.182485	0.014771	-0.095001	-0.002712	0.223700	0.016523	-0.153177	0.161219	...	0.040141	0.005201	0.007469	0.045565	0.178654	-0.055843	-0.052686	0.025408	0.209677	-0.011108
dbg	-0.073444	0.021847	-0.069257	0.067224	-0.062583	0.142269	-0.037266	0.147795	0.007936	0.048285	...	-0.011592	-0.142617	-0.219104	0.187848	-0.007989	-0.069616	0.161464	-0.123364	-0.008985	-0.121119
cherry_popper_burger	0.105912	-0.007242	0.032225	-0.011166	0.078150	-0.106550	0.050849	0.025997	-0.015825	0.051877	...	0.001406	-0.026521	-0.020386	0.164559	0.048097	0.099050	0.026641	-0.151952	-0.110753	-0.136039
panko_tofu	0.072164	-0.061906	-0.196437	0.128953	-0.217588	0.082927	-0.076290	0.043308	-0.053061	0.028401	...	0.105428	0.019887	0.038083	-0.058264	0.109253	-0.087129	0.005693	0.124236	0.011175	-0.016825
arts_factory	-0.023922	-0.032381	-0.059455	-0.161107	0.175643	0.112984	-0.020441	-0.007810	-0.006716	0.075937	...	-0.009376	-0.152361	-0.160350	0.203146	-0.064621	-0.055807	0.205863	-0.056981	-0.063470	-0.082734
waffle_cristo	0.064633	0.010441	-0.087705	-0.206701	-0.086691	-0.098219	-0.070669	0.089888	-0.139562	0.207669	...	-0.037377	0.019772	-0.023730	0.131557	0.125975	-0.149205	-0.057908	-0.021090	0.166594	-0.093824
live_bull_riding	-0.020121	-0.037185	-0.073709	0.024790	-0.044547	0.074793	-0.211100	0.018721	0.026588	0.161480	...	-0.073679	-0.262422	0.108220	0.132400	0.178973	-0.226846	0.038616	-0.041793	-0.098383	-0.060971
joe_'s_panini	0.054685	-0.018606	-0.062711	0.022298	0.055120	0.199279	0.013477	-0.016623	-0.120566	-0.036809	...	0.117595	0.077953	-0.057261	-0.097247	0.214030	0.068710	0.140257	0.071008	0.048277	-0.175368
utsav	-0.036028	-0.150261	-0.091530	-0.093315	-0.124048	0.055841	0.018046	0.057999	0.075539	0.026202	...	0.038337	-0.039832	0.114909	0.174944	-0.071624	0.102008	0.140138	-0.105005	0.049110	-0.171333
salt_caramel_reduction	-0.034777	-0.065648	-0.111095	0.188425	0.059281	0.084901	0.064198	0.072816	-0.071933	-0.101628	...	-0.032517	-0.048360	-0.128952	0.192814	0.074715	-0.133373	-0.023092	0.010593	0.073630	0.074200
paco_taco	-0.172571	0.029508	-0.170693	0.061910	0.002295	-0.027510	-0.037892	0.206089	0.055669	0.026415	...	-0.031886	-0.017035	-0.031281	0.122963	0.182497	-0.217789	0.089432	-0.036055	0.024112	-0.077095
spring_roll_factory	-0.152873	-0.016561	-0.163311	0.080655	-0.052985	0.035782	0.035979	-0.093912	-0.100126	-0.079203	...	0.131779	0.026861	0.008073	0.083142	-0.033400	0.166856	0.207260	-0.058668	-0.103536	-0.077231
walu	0.003410	-0.198903	-0.046219	0.032778	0.103021	-0.046970	-0.117032	0.056857	-0.031481	0.043278	...	0.011977	-0.010446	0.003516	-0.033999	-0.011823	-0.093771	0.015525	-0.068369	0.019892	-0.134252
nyks	0.077990	-0.088403	-0.091568	0.005243	0.062576	0.062798	-0.086387	-0.106902	0.119354	-0.028575	...	-0.125958	-0.199749	0.006240	0.179371	-0.070566	0.059770	0.165530	-0.029762	-0.099495	-0.126951
hotel_du_vin	-0.097392	-0.082668	-0.162407	0.074378	0.088921	0.089053	-0.069399	-0.224495	0.005560	-0.047343	...	-0.045000	-0.017609	0.004789	0.203957	-0.112103	0.149513	0.179853	0.029651	-0.094854	-0.201611
tmc_croissant	-0.029460	-0.013684	-0.069493	-0.081460	0.002520	0.031066	-0.041964	-0.032799	-0.197903	-0.061412	...	0.102436	0.006566	-0.017118	-0.008849	0.097136	0.063581	0.088552	0.169504	-0.056177	-0.132385
nicey_'s	-0.023139	0.061560	-0.012481	-0.075036	-0.002402	0.129344	-0.022766	-0.013659	0.057510	0.112790	...	0.042190	-0.312797	-0.089831	0.048956	0.062004	0.064891	0.133679	-0.198884	0.139090	-0.111885
pig_destroyer	-0.029369	0.059976	-0.092597	0.127759	-0.076446	0.142983	-0.119295	0.197889	-0.080670	0.003274	...	0.082677	-0.041166	-0.011940	0.093334	0.117772	-0.054582	0.075983	-0.050376	-0.107944	-0.256598
mystery_slider	-0.036609	-0.043959	0.044160	0.099741	-0.091817	-0.122815	0.168564	0.035759	-0.005760	0.082288	...	-0.093348	0.011129	-0.050653	0.058820	0.210481	-0.025544	0.098406	0.018150	0.007280	-0.031091
blackthorne	-0.166732	-0.103467	-0.082595	0.081413	0.070971	0.116325	0.124752	-0.197697	-0.038686	0.093534	...	0.012453	0.007383	0.032275	-0.037132	-0.009771	-0.030527	0.152668	-0.120817	0.049524	-0.167222
hana_dog	-0.005503	-0.038900	-0.034886	0.109766	0.142516	-0.066498	0.060914	0.199668	0.051472	-0.047055	...	-0.008223	-0.051110	0.016661	0.154234	0.102785	-0.149700	0.084893	-0.148053	-0.015030	-0.077490
butterfly_atrium	-0.184075	-0.104640	-0.187233	0.144733	-0.063981	0.135346	0.040865	0.004877	-0.005636	-0.019271	...	0.008337	-0.023550	-0.071076	0.249420	0.061651	-0.087122	0.009406	0.016582	-0.101101	0.085414
butterfly_land	-0.174988	-0.255690	-0.023740	0.077248	0.007035	0.094403	0.011344	0.115827	0.076652	0.075374	...	-0.065178	-0.104544	-0.016208	0.113930	-0.137429	-0.066448	0.012164	-0.119094	0.049803	-0.015511
latea	-0.136236	0.014893	-0.126252	-0.046524	-0.178891	-0.004986	0.048919	-0.134382	0.070180	-0.078089	...	0.011183	-0.086829	0.086073	0.114207	0.110448	0.254396	0.056776	-0.142841	0.017397	0.033895
cbp	-0.038400	-0.004677	-0.044935	-0.004361	-0.079060	0.049842	0.223872	0.094907	-0.152996	0.101526	...	0.037062	-0.038483	-0.097984	0.343373	0.087595	0.030496	0.262317	-0.021849	0.050947	-0.247736
maryln_'s	-0.151859	0.092219	-0.162672	0.041987	0.006465	0.031785	-0.025967	0.046757	0.086988	0.076349	...	-0.102409	0.013456	0.066094	0.022523	0.070763	0.034809	0.205410	-0.111769	-0.009630	-0.171394
spiced_pork_tenderloin	-0.043739	-0.153803	-0.014915	0.014596	0.006403	0.015572	-0.096952	0.153498	0.043043	0.072373	...	-0.018109	-0.067642	0.062494	-0.183090	0.123236	-0.020943	0.106726	0.178398	-0.044977	-0.228292